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(57) Abstract 



A nucleic acid fragment comprising a portion of at 
least 17 contiguous nucleotide bases which portion has a se- 
quence the same as, or homologous to a portion of corre- 
sponding length of the sequence of the coding strand as set 
out in Fig. 1 or the same as, or homologous to a portion of 
corresponding length of the sequence complementary to the 
sequence of the coding strand set out in Fig. 1. 
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MUCIN NUCLEOTIDES 
The present invention relates to nucleotide 
fragments, polypeptides and antibodies and their use in 
medical treatment and diagnosis, 
5 In International Patent Application no. WO-A-88/05054 

there is disclosed a tandem repeat sequence contained in 
the human polymorphic epithelial mucin (HP EM) gene and 
nucleotide probes, polypeptides, antibodies and 
antibody-producing cells which are useful in the diagnosis 
10 and treatment of adenocarcinomas such as breast cancer. 

The present inventors have now elucidated the 
nucleotide base sequence of the gene in the region 5* of 
the tandem repeat sequence (unless the context implies 
otherwise, directions such as "5 ,M or "3 ,w , "upstream" or 
15 "downstream" used herein refer to the non-template strand 
of the genomic DNA or fragments thereof) . The complete 
sequence of the 1763 nucleotide bases of the non-template 
strand upstream of and including the first Smal 
restriction site in the tandem repeat is set out in Fig. 
20 l. The sequence of 1575 nucleotide bases of the non- 

template strand upstream of and including the first Smal 
restriction site in the tandem repeat as set out in Fig. 3 
has been extended and some parts have been corrected in 
the light of repeat experiments. The template strand has 
25 a complementary sequence and it is this strand which is 

transcribed into RNA during expression of the gene 
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product. 

In addition to conventional transcriptional and 
translational start sites and intron splicing sites , this 
sequence contains a number of features which may be 
5 important in the diagnosis and therapy of cancers and in 

expression of proteins from recombinant vectors. These 
features will be described below. The amino acid sequence 
corresponding to the translated portions of this 
nucleotide sequence gives rise to peptides and thence to 
10 antibodies and antibody-producing cells which may also be 
useful in such diagnosis and treatment. 

In one aspect the present invention provides a 
nucleic acid fragment comprising a portion of at least 17 
contiguous nucleotide bases which portion has a sequence 
the same as, or homologous to a portion of corresponding 
length of the sequence of the coding strand as set out in 
Fig. l or the same as, or homologous to a portion of 
corresponding length of the sequence complementary to the 
sequence of the coding strand set out in Fig. l. 

As used herein the term "fragment" is intended to 
include restriction endonuclease-generated nucleic acid 
molecules and synthetic oligonucleotides. 

The nucleic acid fragments of the invention may be 
single-stranded or double-stranded and they may be rna or 
25 DNA fragments. Single stranded fragments may be "plus" or 

coding strands having the sequence of Fig. 1 or a part 



15 



20 
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thereof or a sequence homologous thereto. Alternatively 
the single stranded fragments may be "minus" or non-coding 
strands having a sequence complementary to the sequence of 
Fig. 1 or a part thereof or a sequence homologous thereto. 
5 Double stranded fragments contain a complementary pair of 
strands, (ie. one plus strand and one minus strand) . 

RNA fragments according to the invention will, of 
course, contain uridylic acid ("U") residues in place of 
the deoxythymidylic acid residues ( N T") of the coding 
10 (non-template) strand set out in Fig. 1 or, if 

complementary to the sequence of the coding strand, they 
will contain U residues in positions complementary to the 
adenylic acid ("A") residues in the coding strand set out 
in Fig. l. 

15 Preferably the nucleic acid fragments of the 

invention are double-stranded DNA fragments. 
Single-stranded nucleic acid fragments of the invention 
are at least 17 nucleotide bases in length. 
Double-stranded nucleic acid fragments of the invention 

20 are at least 17 nucleotide base pairs in length. 

Preferably the fragments are at least 20 bases or base 
pairs in length, more preferably at least 25 bases or base 
pairs and yet more preferably at least 50 bases or base 
pairs in length. 

25 Statistically it ^is almost certain that a 17 

nucleotide base sequence will be unique so that any 
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10 



15 



20 



25 



nucleic acid fragment having a contiguous portion of 17 
nucleotides of a sequence identical to a portion of 
corresponding length of the coding strand as set out in 
Fig. 1, or the same as the non-coding strand complementary 
to the sequence of Fig. l f will be new. Fragments 
according to the invention which are only 17 nucleotides 
or nucleotide bases in length have a sequence the same as, 
or complementary to, that set out in Fig.l. Longer 
fragments of the invention may have a sequence which is 
homologous to a corresponding portion of the sequence for 
the coding strand as set out in Fig. l or to the 
complementary non-coding strand. 

Preferably nucleic acid fragments according to the 
invention have at least 75% sequence homology with a 
corresponding portion of the sequence of Fig. l or the 
complementary non-coding strand, for instance 80 or 85%, 
more preferably 90 or even 95% homology. Differences may 
arise through deletions, insertions or substitutions. 
In addition to containing a portion homologous to or the 
same as the sequence of the coding strand in Fig. i or 
complementary non-coding strand, the nucleic acid 
fragments of the invention may include sequences 
completely unrelated to that in Fig. l. 

Particular features of interest within the coding 
strand in Fig. i are set out in Tables 1 to 3 below: 
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TABLE 1: Signal Sequences 



Location* 


Sequence in PEM 


Significance 


1-2 




CG 


transcriptional start site 


73-75 




ATG 


translational start signal 


131-132 




GT 


start of first intron 


631-632 




AG 


end of first intron 


100-130 
and 

633-637 


} 
} 
} 


TTCCTGCTGCTGCT- 
CCTCACAGTGCTTA- 
CAG. . .TTGTT 


Signal sequence, interrupted 
by first intron (first intron 
indicated by" .•."). 


955-960 




CCCGGG 


Sin a I site at start of tandem 
repeat 



R is A or G 

K is A, C, G or T 

W is A or T 

X is 

Y is C or T 

of the indicated PEM 

1. 



15 Footnotes to Tables 1 and 2 

+ In the consensus sequences : 



20 

* Locations are of the 5 V base 
sequence numbered as in Fig. 
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TABIJ! 2 Regulato ry elements within the 5- flanking 
Regulatory element Consensus Sequence" 1 " 



sequence 



Sequence in PEM 



Location* 



Glucocorticoid regulatory element: 

TGTTCT 



SP1 

SV40 enhancer element 
a 
b 
c 

AP-1 

AP-2 

NF1/CTF 

lucocor 
Core sequence 

Consensus sequence 

CACCC factor 

Progesterone receptor 
consensus sequence 

Estrogen consensus 
sequence 

RNA Polymerase III 
BOX A 

Box B 

Enh a n cer sequences: 

Inter feron-B seq 

CMV enhancer 



GGGCGG 



ATGTGTGT 
GCATGCAT 
GTGGATAG 

CTGACTCA 
G A 



CCCCAGGC 
G G 



TTGGCTNNHAGCCAA 



GGTACANNMTGTTCT 
CACCC 

ATTCCTCTGT 

GGTCANNNTGACC 

RRXNNARJfXGG 
GWTCRANNC 

GGAAATTCCTCTG 
GGAAAGTCCCGTT 



GGGCGG 
GGGCGG 
GGGCGG 

GGGCGGGCGGGCGGG 



CTGTGGGT 
GCCTGCCT 
GTGGAGAG 

GTGACCAC 
CTGCTTCA 
GTGCCTAG 
CTGCCTGA 

ACCCAGGC 
CACCGGGC 

TTGGCTTTCTCCAA 



TGTTCT 
TGTTCC 

GCCTGAATCTGTTCT 
AGCTGGCTTTGTTCC 

CACCC 
CACCC 

ACTCCTCTCC 
ACTCCTCCTT 
ATTTCTCGGC 



GCTCCCGGTGACC 



GACCTAGCTGG 
AGTGGAGTGGG 
GTTCCAGAC 



GGAAATTTCTTCC 
GGAAAGTCCGGCT 



-727 
-397 
-94 
-54 



-562 

+25 

-702 

-739 
-418 
-61 
+27 

-597 
+77 

-618 



+38 
-321 

+29 
-330 

+54 
+84 

-802 
-626 
-432 



-746 



-335 
-388 
-260 



-642 
-585 



SUBSTITUTE SHEET 
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The sequence in Fig, 1 also includes two sites occurring in 
the promoter region and in the first intron having 70 to 
80% homology with the mammary consensus, sequence (Rosen, 
J.M. in "The Mammary Gland, Development, Regulation and 
Function' 1 , Ed. Nevill, M.C. and Daniel, C.W. Plenum Press, 
pp 301*322). These sites are set out in Table 3 below: 



10 



Location 


Sequence 




*** * * 


-289 to -274 


AGGCTAAAACTAGAGC 




* ** ** 


+230 to +245 


GTAAGAATTGCAGACA 


Consensus 


RGAAGRAAANTGGACA 



Positions are numbered in accordance with Fig. 1. 
* indicates a mismatch with the consensus sequence. 
In the consensus sequence:- R is A or G. 
20 N is A, C, G or T. 



Preferred fragments according to the present 
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invention include the transcriptional and translational start 
signals, "TATAA" box and at least one of the regulatory 
elements (transcription factor binding sites) set out in 
Table 2 above. More preferably these fragments contain 2 or 
5 more, for instance 3, 4 or 5 of the regulatory elements in 
addition to the TATAA box or even all of the regulatory 
elements set out in Table 2. Those fragments containing more 
than one of the regulatory elements of Table 2 preferably 
also preserve the relative spacings of those sites from one 
10 another and from the TATAA box and transcriptional and 
translational start signals. 

Other preferred fragments of the invention contain at 
least one of the regions homologous to the mammary consensus 
sequences as set out in Table 3. Preferably these fragments 
15 contain both of the regions having homology with the mammary 
consensus sequences as set out in Table 3. Those fragments 
containing both regions having homology with the mammary 
consensus sequence preferably also preserve the relative 
spacing of those regions, as found in Fig. l, from one 
20 another and from the TATAA box and transcriptional and 
translational start signals. 

Yet further preferred fragments according to the 
invention comprise the TATAA box, the transcriptional and 
translational start signals, at least one and preferably two 
25 or more of the regulatory elements as set out in Table 2 and 
at least one and preferably both of the regions having 
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homology with the mammary consensus seguence as set out in 
Table 3. Yet more preferably these fragments also preserve 
the relative spacing of the features from Tables 1, 2 and 3. 
Particularly preferred fragments according to the invention 
5 comprise the seguence upstream of the TATAA box as set out 
in Pig. 1 together with, and downstream thereof, 
transcriptional and translational start signals and a 
polypeptide coding seguence in correct reading frame 
register with the promoter seguences and the TATAA box, 
10 transcriptional and translational start signals. The coding 
seguence may encode a part or parts of the polypeptide 
encoded by the mucin gene, for instance a part or parts 
thereof other than the tandem repeat sequence, or 
polypeptides unrelated to that encoded by the mucin gene. 
15 Other particularly preferred fragments according to 

the present invention comprise promoter sequences, a TATAA 
box, transcriptional and translational start signals and, 
downstream thereof and in correct reading frame register 
therewith a coding seguence corresponding to a portion of 
20 the mucin gene, for instance corresponding to the first exon 
(corresponding to bases (i to 130 of Fig.l.) or a part 
thereof and/or the second exon (corresponding to bases 633 
onwards in Fig.l.) or a part thereof, for instance a part 
thereof other than the tandem repeat seguence as set out in 
25 WO-A-88/05054. 

In an especially preferred aspect the fragments 
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contain (i) the first 26 bases (bases 1 to 26 of Fig. l) or 
(ii) the whole of the first exon (bases 1 to 130 of Fig.l.) 
and/or (iii) the splicing/ ligating sites for the first intron 
set out in Table 1 and a non-coding sequence between these 
5 sites. The non-coding sequence may be the same as or 

different to the sequence of the first intron as shown in 
Fig. l. Preferably it is the same. 

Other preferred fragments of the invention comprise 
at least a portion of the first intron (bases 231 to 632 of 
10 Fig. l) . Further preferred fragments of the invention 
comprise at least a portion of the 5" -flanking sequence 
upstream of base -423 of Fig. 1. 

Other preferred fragments of the invention comprise a 
portion of the sequence of Fig. 1 corresponding to a portion 
15 of the sequence of Fig. 3. 

Further preferred fragments of the invention comprise 
a combination of any two or more of the foregoing preferred 
features. 

Fragments according to the present invention 
20 containing functional coding sequences for a least a part of 
the first or second exons set out in Fig. l are useful in the 
production of polypeptides corresponding to a part or all of 
the mucin gene product. Such polypeptides are, in turn 
useful as immunogenic agents for instance in active 
25 immunisation against Human Polymorphic Epithethial Mucin 
(HPEM) for the prophylactic or therapeutic treatment of 
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cancers or raising antibodies for use in passive immunisation 
and diagnosis of cancers. For use in such methods the 
fragment, which codes for a polypeptide chain substantially 
identical to a portion of the mucin core protein, may be 
5 extended at either or both the 5* and 3 f ends with further 
coding or non-coding nucleic acid sequence including 
regulatory and promoter sequences, marker sequences, and 
splicing or ligating sites. Coding sequences may code for 
other portions of the mucin core protein chain (for instance, 
10 other than the tandem repeat) or for other polypeptide 

chains. The fragment according to the invention, together 
with any necessary or desirable flanking sequences is 
inserted, in an appropriate open reading frame register, into 
a suitable vector such as a plasmid, or cosmid or a viral 
15 genome (for instance vaccinia virus genome) and is then 
expressed as a polypeptide product by conventional 
techniques. In one aspect the polypeptide product may be 
produced by culturing appropriate cells transformed with a 
vector, harvested and used as an immunogen to induce active 
20 immunity against the mucin core protein [Tartaglia et al. , 
Tibtech, 6, 43: (1988)]. 

Fragments according to the present invention 
incorporating regulatory elements of Table 2 and/or mammary 
consensus sequences of Table 3 may be used in securing 
25 tissue-specific expression of functional coding sequences in 
appropriate reading frame register downstream of the 



WO 91/09867 



PCT/GB90/02020 



- 12 - 

regulatory elements and/or associataed with the mammary 
consensus sequences. Such fragments may therefore be used 
to express parts or the whole of the mucin gene or any other 
coding sequence in cells of epithelial origin. Applications 
of this are in therapy and immunisation where such fragments 
and associated coding sequences are administered to patients 
such that the coding sequence will be expressed in 
epithelial tissues leading to a therapeutic effect or an 
immune reaction by the patient against the polypeptides. 

The fragments may be presented as inserts in a vector 
such as viral genomic nucleic acid and introduced into the 
patients by inoculation of the vector for instance as a 
modified virus. The vector then directs expression of the 
polypeptide In vivo and this in turn serves as a therapeutic 
agent or as an immunogen to induce active immunity against 
the polypeptide. This strategy may be adopted, for 
instance, to secure expression of polypeptides encoded by 
the HPEM gene for treatment or prophylaxis of 
adenocarcinomas such as breast cancer or to secure tissue 
specific expression of other peptides under control of the 
regulatory sequences of Table 1, for instance by 
administration of a modified vaccinia virus containing the 
fragment and coding sequences in its genomic DNA. RNA 
fragments of the invention may similarly be used by 
administration via a retroviral vector. Selection of tissue 
specific virus vectors to carry the fragments of the 
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invention and coding sequences will further restrict 
expression of the polypeptide to desired target tissues. 

Fragments of the invention nay also be used to 
control expression of oncogenic proteins in experimental 
5 transgenic animals. Thus, for instance, a transgenic mouse 
having an oncogene such as ras, erbB-2 or int 2 expressed 
under control of the present tissue specific fragments may 
develop breast tumours and be useful in testing diagnostic 
agents such as tumour localisation and imaging agents and in 
10 testing therapeutic agents such as immunotoxins . 

Nucleic acid fragments according to the invention are also 
useful as hybridisation probes for detecting the presence of 
DNA or RNA of corresponding sequence in a sample. For use 
as probes fragments are preferably labelled with a 
15 detectable label such as a radionuclide, enzyme label, 
fluorescent label or other conventional directly or 
indirectly detectable labels. For some applications, the 
probes may be bound to a solid support. Labelling of the 
probes may be achieved by conventional methods such as set 
20 out in Matthews et al . , Anal . Biochem . 
169: 1-25 (1988). 

In further aspects, the present invention provides 
cloning vectors and expression vectors containing fragments 
according to the present invention. The vectors may be, for 
25 instance, plasmids, cosmids or viral genomic DNA. The 

present invention further provides host cells containing 
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such cloning and expression vectors, for instance epithelial 
cells transformed with functional expression vectors 
containing expressible fragments according to the invention. 

The invention further provides nucleic acid fragments 
5 which encode polypeptides as defined below, such fragments 
may be fragments as hereinbefore defined. However, in view 
of the redundancy of the genetic code, nucleic acid 
sequences which differ slightly or substantially from the 
sequence of Fig. 2 may nevertheless encode the same 
10 polypeptide. 

The nucleic acid fragments of the invention may be 
produced 3§ nova by conventional nucleic acid synthesis 
techniques or obtained from human epithelial cells by 
conventional methods, Huynh g£ »DNA Cloning: A 

Practical Approach" Glover, D.M. (Ed) IPX, Oxford, Vol 1, 
PP49-78 (1985) . 

The invention therefore also provides probes, vectors and 
transformed cells comprising nucleic acid fragments as 
hereinbefore defined for use in methods of treatment of the 
human or animal body by surgery or therapy and in diagnostic 
methods practiced on the human or animal body and for use in 
the preparation of medicaments for use in such methods. The 
invention also provides methods for treatment of the human 
or animal body by surgery or therapy and diagnostic methods 
practiced in vivo as well as ex vivo and in vitro which 
comprise administering such fragments, probes, vectors or 



15 



20 



25 
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transformed cells in effective non-toxic amount to a human 
or other mammal in need thereof. 

Processes for producing fragments according to the 
invention and probes, vectors and transformed cells 
5 containing them and processes for expressing polypeptides 
encoded by, or under the regulatory control of, fragments of 
the invention also form aspects of the invention. 
The invention further provides a polypeptide comprising a 
sequence of at least 5 amino acid residues encoded by the 
10 coding portion of the DNA sequence as indicated in Fig. 2. 
Polypeptides according to the invention preferably have a 
sequence of at least 10 residues, for instance at least 15, 
more preferably 20 or more residues and most preferably all 
the residues shown in Fig. 2. 
15 The polypeptide may additionally comprise N-terminal 

and/or C-terminal sequences not encoded by the DNA sequence 
indicated by Fig. 2. 

Polypeptides of the invention containing more than 5 
amino acid residues encoded by the DNA sequence in Fig. 2 
20 may include minor variations by way of substitution, 

deletion or insertion of individual amino acid residues. 
Preferably such polypeptides differ at not more than 20% 
preferably not more than 10% and most preferably not more 
than 5% of residues in a contiguous portion corresponding to 
25 a portion of the sequence in Fig. 2. 

The invention further provides polypeptides as 
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defined above modified by addition of a linkage sugar such 
as N-acetyl galactosamine on serine and/ or threonine 
residues and polypeptides modified by addition of 
oligosaccharide moieties to N-acetyl galactosamine or via 
5 other linkage sugars. Optionally modified polypeptides 
linked to carrier proteins such as keyhole limpet 
haemocyanin, albumen or thyroglobulin are also within the 
invention. 

Polypeptides according to the invention may be 
0 produced de novo by synthetic methods or by expression of 

the appropriate DNA fragments described above by recombinant 
DNA techniques and expressed without glycosylation in human 
or non-human cells. Alternatively they may be obtained by 
deglycosylating native human mucin glycoprotein (which 
5 itself may be produced by isolation from samples of human 
tissue or body fluids or by expression and full processing 
in a human cell line) [Burchell et ^1. , Cancer Research r 47: 
5467-5482, (1987), Gendler et al. , P.N.A.S. . 84: 6060-6064, 
(1987)], and digesting the core protein. The polypeptides 
0 of the invention are useful in active immunisation of 

humans, for raising antibodies in animals for use in passive 
immunisation, diagnostic tests, tumour localisation and, 
when used in conjunction with a cytotoxic agent, for tumour 
therapy . 

* The invention further provides antibodies against 

any of the polypeptides described above. 
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As used hereafter the term "antibody" is intended to 
include polyclonal and monoclonal antibodies and fragments 
of antibodies bearing antigen binding sites such as the 
F(ab , ) 2 fragments as well as such antibodies or fragments 
5 thereof which have been modified chemically or genetically 
in order to vary the amino acid residue sequence of one or 
more polypeptide chains, to change the species specific 
and/ or isotype specific regions and/ or to combine 
polypeptide chains from different sources. Especially in 
10 therapeutic applications it may be appropriate to modify the 
antibody by coupling the Fab, or complementarity- 
determining region thereof, to the Fc, or whole framework, 
region of antibodies derived from the species to be treated 
(e.g. such that the Fab region of mouse monoclonal 
15 antibodies may be administered with a human Fc region to 
reduce immune response by a human patient) or in order to 
vary the isotype of the antibody (see EP-A-0 239 400). Such 
antibodies may be obtained by conventional methods 
[Williams, Tibtech . 6:36, (1988)] and are useful in 
20 diagnostic and therapeutic applications, such as passive 
immunisation. 

The term "antibodies" used herein is further intended 
to encompass antibody molecules or fragments thereof as 
defined above produced by recombinant DNA techniques as well 
25 as so-called "single domain antibodies" or "dAbs" such as 
are described by Ward, E.S. et al. , Nature . 341 ;544-546 
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10 



15 



20 



25 



(1989) which are produced in recombinant microorganisms, 
such as Escherichia £fili, harboring expressible dna 
sequences derived from the DNA encoding the variable domain 
of an immunoglobulin heavy chain by random mutation 
introduced, for instance, during polymerise chain reaction 
amplification of the original DNA. Such dAbs may be 
produced by screening a library of such randomly mutated DNA 
sequences and selecting those which enable expression of 
polypeptides capable of specifically binding the 
polypeptides of the invention or HPEM core protein. 

Antibodies according to the present invention react 
with HPEM core protein, especially as expressed by colon, 
lung, ovary and particularly breast carcinomas, but have 
reduced or no reaction with corresponding fully processed 
HPEM. In a particular aspect the antibodies react with HPEM 
core protein but not with fully processed HPEM glycoprotein 
as produced by the normal lactating human mammary gland. 

Antibodies according to the present invention 
preferably have no significant reaction with the mucin 
glycoproteins produced by pregnant or lactating mammary 
epithelial tissues but react with the mucin proteins 
expressed by mammary epithelial adenocarcinoma cells. These 
antibodies show a much reduced reaction with benign breast 
tumours and are therefore useful in diagnosis and 
localisation of breast gancer as well as in therapeutic 
methods . 
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Further uses of the antibodies include diagnostic 
tests of assays for detecting and/or assessing the severity 
of breast, colon, ovary and lung cancers. 

The antibodies may be used for other purposes 
5 including screening cell cultures for the polypeptide 
expression product of the human mammary epithelial mucin 
gene, or fragments thereof t particularly the nascent 
expression product. In this case the antibodies may 
conveniently be polyclonal or monoclonal antibodies. 
10 The invention further provides antibodies linked to 

therapeutically or diagnostically effective ligands. For 
therapeutic use of the antibodies the ligands are lethal 
agents to be delivered to cancerous breast or other tissue 
in order to incapacitate or kill transformed cells. Lethal 
15 agents include toxins, radioisotopes and "direct killing 
agents" such as components of complement as well as 
cytotoxic or other drugs. 

For diagnostic applications the antibodies may be 
linked to ligands such as solid supports and detectable 
2 0 labels such as enzyme labels, chromophores, fluorophores and 
radioisotopes and other directly or indirectly detectable 
labels. Preferably monoclonal antibodies are used in 
diagnosis. 

Antibodies according to the present invention may be 
25 produced by inoculation «of suitable animals with a 
polypeptide as hereinbefore described. Monoclonal 
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antibodies are produced by known methods, for instance by 
the method of Kohler & Milstein [Nature, 256: 495-497 
(1975)] by immortalising spleen cells from an animal 
inoculated with the mucin core protein or a fragment 
5 thereof, usually by fusion with an immortal cell line 

(preferably a myeloma cell line) , of the same or a different 
species as the inoculated animal, followed by the 
appropriate cloning and screening steps. 

Antibody-producing cells obtained from animals 
10 inoculated with polypeptides of the invention and 
immortalised such cells form further aspects of the 
invention. 

The invention further provides polypeptides, 
antibodies and antibody producing cells, such as hybridomas, 
as hereinbefore defined for use in methods of surgery, 
therapy or diagnosis practiced on the human or animal body 
or for use in the production of medicaments for use in such 
methods. The invention also provides a method of treatment 
or diagnosis which comprises administering an effective 
non-toxic amount of a polypeptide or antibody as 
hereinbefore described to a human or animal in need thereof. 

Processes for producing polypeptides according to the 
invention whether by expression of nucleic acid fragments of 
the invention of otherwise, and for producing antibodies or 
fragments thereof and for producing antibody-producing cells 
such as immortalised cells, form further aspects of the 



15 



20 



25 
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invention. 

The invention further provides a diagnostic test or 
assay method comprising contacting a sample suspected to 
contain abnormal human mucin glycoproteins with an antibody 
5 as defined above. Such methods include tumour localisation 
involving administration to the patient of the antibody 
bearing detectable label or administration of an antibody 
and, separately, simultaneously or sequentially in either 
order, administering a labelling entity capable of 

10 selectively binding the antibody or fragment thereof. 

Diagnostic test kits are provided for use in diagnostic 
tests or assays and comprise antibody and, optionally, 
suitable labels and other reagents and, especially for use 
in competitive assays, standard sera. 

15 The invention will now be illustrated with reference 

to the figures of the accompanying drawings in which: 

Fig. 1. shows the deoxynucleotide base sequence of the 1763 
bases upstream of and including the first Smal restriction 

20 site in the tandem repeat sequence of WO-A-88/05054 using 

the conventional symbols A, C, G and T for the bases of the 
non-template strand. The base sequence is arranged in 
blocks of ten. Untranscribed sequence is in lower case, 
transcribed sequence is in upper case. The SP1 regulatory 

25 elements (Table 2) , TATAA box, transcriptional and 
translational start sites (Table 1) are underlined. 
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Fig. 2. shows the sequence of the non-template strand 
commencing from the transcriptional start site, (residue l 
in Fig. l.) and excluding the sequence of the first intron 
(bases 131 to 632 of the sequence in Fig.l.). Fig. 2 also 
5 shows the predicted sequence of the polypeptide using the 
conventional 1 letter symbols for the amino acid residues. 
Amino acid residues are numbered down the left-hand side and 
nucleotide bases down the right hand side. The signal 
sequence is underlined. The sequences end at the first Smal 
10 site in the tandem repeat. 

Fig. 3. shows the deoxy nucleotide base sequence of the 1575 
bases upstream of and including the first Smal restriction 
site in the tandem repeat sequence of WO-A-88/05054 using 

15 the conventional symbols A, C, G and T for the bases of the 
non-template strand. The base sequence is arranged in 
blocks of ten in non-coding regions. The exon sequences are 
shown in blocks of three and translated codons are 
underlined. The start positions of exons 1 and 2, intron 1 

20 and the signal sequence for exon splicing are numbered and 
labelled. Other features mentioned in Tables 1 and 2 are 
boxed. The sequence finishes with the first Smal site of 
the tandem repeat sequence. 



The present invention does not extend to fragments, 
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polypeptides and antibodies or related materials such as 
vectors and cells, which are specifically disclosed in WO-A- 
88/05054 or WO-A-90/05142, nor to the CDNA fragment whose 
sequence is indicated in Abe, M. et al . , in Biochemical and 
5 Biophysical Research Communications . 165(2) ; 644-649 (1989). 

The invention will now be illustrated by the 
following Examples: 

EXAMPLE 1 

In an attempt to obtain clones with 5' unique 
10 sequences, two gtlO libraries were screened with a probe 
for the tandem repeat. All the clones obtained lacked any 
non-repetitive sequence at the 5 1 terminus. Thus, a 
different strategy was adopted. To obtain 5' sequence we 
synthesized the cDNA corresponding to the 5 1 end of breast 
15 cancer cell line transcript using anchored-polymerise chain 
reaction (A-PCR) . The A-PCR procedure [Loh, E.Y. et al. , 
Science . 243 ; 217-220, (1989)] was used to synthesize cDNA 
corresponding to the 5' end of the transcript. For the 5' 
end clones total RNA (5 jxg) prepared by the guanidinium 
20 isothiocyanate method [Chirgwin, J.M. et al., Biochem . . 18: 
5294-5299 (1979) ] was used for first strand synthesis using 
a breast cancer cell line (BT20) transcript with AMV-reverse 
transcriptase (Life Sciences) in a 40 m1 reaction mixture 
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10 



15 



20 



25 



[Okayama, H. and Berg, P., Mol. Ce n n^-, | j: 161-170 
(1982)] containing 1 W of an oligonucleotide primer made to 
the tandem repeat (5 'CCAAGCTTGGAGCCCGGGGCCGGCCTGGTGTCCGG3 ' ) . 
The total UNA was subjected to reverse transcription, and 
the products were precipitated with spermine, a poly(dG) 
tail was introduced with terminal deoxy-transferase (500 
U/ml, Pharmacia). Amplification was performed with Thermus 
aguaticus polymerise (Perkin Elmer Cetus) in 100 M l of the 
standard buffer supplied. The primers included the tandem 
repeat primer and for the poly(dG) end, a mixture of the AN 
polyc primer (5'GCATGCGCGCGGCCGCGGAGGCCCCCCCCCCCCCC3') and 
the AN primer (5 'GCATGCGCGCGGCCGCGGAGGCC3 ' ) at a ratio of 
1:9. Following an initial denaturation at 94<> c for 5 ^ 
the reaction was annealed at 55°C for 2 »i„, extended at 
72 °c for 2.5 min and denatured at 94 °c for 1.5 min. 
Amplification was performed for 30 cycles, and the product 
was precipitated with ethanol. The DNA was sequentially cut 
with HindlH and Sacll, separated on a 1.2% agarose Gel and 
the band of approximately 550 bp was purified onto DEAE 
membrane (Schleicher and Schuell) , ligated into pBS-SK+ and 
transformed into bacteria XL-l (Stratagene), This plasmid 
will be referred to as pBS-S-PEM. A11 restriction enzymes 
used were obtained from New England Biolabs Inc., 
oligonucleotide primers and probes were synthesized on an 
Applied Biosystems 380B^DNA synthesizer. 

Four colonies were selected for sequencing, and the 
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sequences agreed with each other and with sequence obtained 
from genomic clones of the region* A Leader sequence of 72 
bp preceded the first ATG which was in-frame with the 
reading frame of the tandem repeat as previously determined 
5 (Fig. 1) , and the sequence preceding first ATG, CCACCATGA, 
agrees with the Kozak consensus sequence (Kozak, M. . Nucl. 
Acids. Res ., A2: 857-872 (1984). 

The primer extension technique was use to map 
precisely the position of the capsite. A 21 bp 
10 oligonucleotide primer (5 1 AGACTGGGTGCCCGGTGTCAT3 9 ) 

corresponding to nucleotides 73 to 93 ending at the A of ATG 
(Fig. 1) was end-labelled with [tf- 32 P]ATP (> 5000 Ci/mmol r 
Amersham International pic) using T4 poly-nucleotide kinase 
(Pharmacia) and precipitated three times with equal volumes 
15 of 4 H ammonium acetate to remove free [&- 32 P]ATP from the 
kinased oligonucleotide. Labelled primer (1 x 10 5 dpm at l 
x 10 7 dpm/pmole) was annealed to 40 /xg of total BT 20 RNA in 
120 mM sodium chloride at 95°C for 5 min, held at 65°C for 1 
h and cooled to room temperature. The annealed primer was 
20 extended using 18 units of reverse transcriptase in SOmJl 
Tris pH 8.3 at 45°C, 6 mM magnesium acetate, 10 m& 
dithiothreitol, 1.8 mM dNTPs in a total volume of 50 /il at 
45°C for lh. The reaction was stopped by the addition of 50 
mM EDTA and the RNA digested by treatment with RNase-A at 
25 400/ig/ml for 15 min at 37°C. The samples were than 

phenol: chloroform extracted prior to ethanol precipitation 
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and electrophoresed on a standard 6% sequencing gel yielding 
two bands which mapped to two C's, 72 and 71 bases upstream 
of the ATG« The sequencing ladder was single-stranded 
control DNA (M13mpl8) from the Sequenase kit (US Biochemical 
5 Corp • ) . 

The most prominent product was 72 bp, equal to the 
number of base pairs from the 5 1 end of the oligonucleotide 
primer to the 5» end of the PCR-derived clone, thus 
confirming that the cDNA represents the entire length of its 

10 corresponding cellular mRNA 5* to the tandem repeat. The 
presence of a second band may be due to interference with 
reverse transcriptase by methylation of the C at base 71 r 
since it forms a CpG dinucleotide. Under identical 
conditions, no primer extension product was seem using RNA 

15 from Daudi cells which do not express the PEM mucin. 

Clonino 

A plasmid library, grown in DHlacells (RecA-) , was 
used instead of a lambda library, because of the possibility 

20 of recombination occurring when lambda is grown in RecA+ 

cells. This recombination might have been expected, since 
a part of the tandem repeat sequence (GCTGGGGG) is closely 
related to the chi sequence (GCTGGTGG) of lambda phage which 
has been implicated as ^ hotspot for RecA-mediated 

25 recombination in E.coli. 
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Nucleotide sequence of cDNA clones 

Fig 1. shows the DNA sequence from the 5' 
A-PCR-derived clone, including the consensus sequence of the 
tandem repeat. Sequences were determined in both 
5 directions. The region of conserved tandem repeats was not 
sequenced in full, although a cDNA tandem repeat clone 
obtained previously had been circularised, sonicated and 
about 40 clones sequences [ (Gendler et al. . J. Biol . Chem . . 
ZSll 12820-12823 (1988)]. 
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Predicted amino a cid sequence and composition of the PEM 
core protein . 

The core protein amino acid composition is dominated 
5 by the amino acid composition of the tandem repeat. Serine, 
threonine, proline, alanine and glycine account for about 
60% of the amino acids. 

The deduced sequence of the PEM core protein consists 
of distinct regions including (1) the N-terminal region 

10 containing a hydrophobic signal sequence and degenerate 
tandem repeats and (2) the tandem repeat region itself. 
At the N-terminus a putative signal peptide of 13 amino 
acids follows the first 7 amino acids. However, the actual 
site of cleavage has not been determined as attempts to 

15 obtain N-terminal sequence of the core protein were hindered 
by a blocked amino terminus. Following the signal sequence 
and preceding the first Smal site (which is used to define 
the beginning of the tandem repeat region) are 107 amino 
acids. Greater than 50% of these amino acids comprise 

20 degenerate tandem repeats. Since the number of tandem 
repeats per molecule is large (greater than 21 for the 
smallest allele we have observed) , this domain forms the 
major part of the core protein, and results in a highly 
repetitive structure which is extremely immunogenic 

25 [Gendler, S. et al M loc. cit] . The sequence of the 20 
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amino acid tandem repeat unit corresponds to what might be 
expected for a protein which is extensively O-glycosylated. 
Five serines and threonines, four of which are in doublets, 
are found in the repeat and these potential glycosylation 
5 sites are separated by regions rich in prolines (See Fig. 
2). 
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CLAIMS 



1- A nucleic acid fragment comprising a portion of at 
least 17 contiguous nucleotide bases which portion has a 
sequence the same as, or homologous to a portion of 
5 corresponding length of the sequence of the coding strand as 
set out in Fig. 1 or the same as, or homologous to a portion 
of corresponding length of the sequence complementary to the 
sequence of the coding strand set out in Fig. 1. 



2. A fragment according to claim 1 comprising any one or 
10 more of the following: 

(a) a signal sequence 

TTCCTGCTGCTGCTCCTCACAGTGCTTACAGXTTGTT 
wherein X is an optionally present intron 
a mammary consensus sequence AGGCTAAAACTAGACC 



15 



(c) a mammary consensus sequence GTAAGAATTGCAGACA 



(e) 



a homologue of a sequence (a) , (b) or (c) and 

a sequence complementary to a sequence (a), (b) f (c) 

or (d). 



3 

20 



A hybridisation probe comprising a fragment according 
to claim 1 or claim 2 bearing a detectable label or linked to 



a solid support. 



4. 



A cloning or expression vector comprising a fragment 
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according to claim 1 or claim 2. 

5. A transformed cell comprising a cloning or expression 
vector according to claim 4. 

6. A polypeptide comprising a sequence of at least 5 

5 contiguous acid residues encoded by the coding portion of the 
DNA sequence as indicated in Fig. 2. 

7. An antibody against a polypeptide according to claim 
6. 

8. An antibody according to claim 7 bearing a detectable 
10 label or linked to a solid support. 

9. An antibody-producing cell capable of secreting an 
antibody according to claim 7. 

10. A diagnostic kit comprising a fragment according to 
claim 1 or claim 2 or a probe according to claim 3 or a 

15 polypeptide according to claim 6 or an antibody according to 
claim 7 or claim 8. 

11. A fragment according to claim 1 or claim 2 or a probe 
according to claim 3 or a vector according to claim 4 or a 
cell according to claim 5 or claim 9 or a polypeptide 
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according to claim 6 or an antibody according to claim 7 or 
claim 8 for use in a method of treatment or diagnosis 
practised on the human or animal body. 

12. Use of a fragment according to claim l or claim 2 or 
5 a probe according to claim 3 or a vector according to claim 4 
or a cell according to claim 5 or claim 9 or a polypeptide 
according to claim 6 or an antibody according to claim 7 or 
claim 8 in the preparation of a medicament for use in a 
method of treatment or diagnosis practised on the human or 
10 animal body. 



13. A method of treatment or diagnosis comprising 
administering to a cancer patient in need thereof or 
suspected to have a cancer an effective non-toxic amount of a 
fragment according to claim 1 or claim 2 or a probe according 

15 to claim 3 or a vector according to claim 4 or a cell 

according to claim 5 or claim 9 or a polypeptide according to 
claim 6 or an antibody according to claim 7 or claim 8. 

14. A method of diagnosis comprising contacting a sample 
from a patient with a fragment according to claim 1 or claim 

20 2 or a probe according to claim 3 or a vector according to 
claim 4 or a cell according to claim 5 or claim 9 or a 
polypeptide according to claim 6 or an antibody according to 
claim 7 or claim 8. 
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